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Abstract. In this paper, we study a new type of clustering problem, called Chromatic Clustering, in high 
dimensional space. Chromatic clustering seeks to partition a set of colored points into groups (or clusters) 
so that no group contains points with the same color and a certain objective function is optimized. 
In this paper, we consider two variants of the problem, chromatic fc-means clustering (denoted as k- 
CMeans) and chromatic k- medians clustering (denoted as fc-CMedians), and investigate their hardness and 
approximation solutions. For fc-CMeans, we show that the additional coloring constraint destroys several 
key properties (such as the locality property) used in existing fc-means techniques (for ordinary points), 
and significantly complicates the problem. There is no FPTAS for the chromatic clustering problem, even if 
fc = 2. To overcome the additional difficulty, we develop a standalone result, called Simplex Lemma, which 
enables us to efficiently approximate the mean point of an unknown point set through a fixed dimensional 
simplex. A nice feature of the simplex is its independence with the dimensionality of the original space, 
and thus can be used for problems in very high dimensional space. With the simplex lemma, together with 
several random sampling techniques, we show that a (1 + e)-approximation of fc-CMeans can be achieved 
in near linear time through a sphere peeling algorithm. For fc-CMedians, we show that a similar sphere 
peeling algorithm exists for achieving constant approximation solutions. 



1 Introduction 



Clustering is one of the most fundamental problems in computer science and finds applications in 
many different areas [2}|4j[6j[7j[9}|l2}[l4j[l6]. Most existing clustering techniques assume that the to-be- 
clustered data items are independent from each other. Thus each data item can "freely" determine 
its membership within the resulting clusters, without paying attention to the clustering of other data 
items. In recent years, there are also considerable attentions on clustering dependent data and a num- 
ber of clustering techniques, such as correlation clustering, point-set clustering, ensemble clustering, 



and correlation connected clustering, have been developed 4 7 9 -11 



In this paper, we consider a new type of clustering problems, called Chromatic Clustering, for 
dependent data. Roughly speaking, a chromatic clustering problem takes as input a set of colored 
data items and groups them into clusters, according to certain objective functions, so that no pair of 
items with the same color are grouped together (such a requirement is called chromatic constraint). 
Chromatic clustering captures the mutual exclusiveness relationship among data items and is a rather 
useful model for various applications. Due to the additional chromatic constraint, chromatic clustering 
is thus expected to simultaneously solve the "coloring" and clustering problems, which significantly 
complicates the problem. As it will be shown later, the chromatic clustering problem is challenging 
to solve even for the case that each color is shared only by two data items. 

For chromatic clustering, we consider in this paper two variants, Chromatic k-means Clustering 
(k-CMeans) and Chromatic k-median Clustering (k-CMedians), in M d space, where the dimensionality 
could be very high and A; is a fixed number. In both variants, the input is a set Q of n point-sets 
, G n with each containing a maximum of k points in (i-dimensional space, and the objective is 
to partition all points of Q into k different clusters so that the chromatic constraint is satisfied and 
the total squared distance (i.e., fc-CMeans) or total distance (i.e., fc-CMedians) from each point to 
the center point (i.e., median or mean point) of its cluster is minimized. 

Motivation: The chromatic clustering problem is motivated by several interesting applications. 
One of them is for determining the topological structure of chromosomes in cell biology [10] . In such 
applications, a set of 3D probing points (e.g., using BAC probes) is extracted from each homolog of 
the interested chromosome (see Figure [6] in Appendix), and the objective is to determine, for each 
chromosome homolog, the common spatial distribution pattern of the probes among a population of 
cells. For this purpose, the set of probes from each homolog is converted into a high dimensional feature 
point in the feature space, where each dimension represents the distance between a particular pair of 
probes. Since each chromosome has two (or more as in cancer cells) homologs, each cell contributes 
k (i.e., two or more) feature points. Due to technical limitation, it is impossible to identify the same 
homolog from all cells. Thus, the k feature points from each cell form a point-set with the same color 
(meaning that they are undistinguishable). To solve the problem, one could chromatically cluster all 
point-sets into k clusters (after normalizing the cell size), with each corresponding to a homolog, and 
use the mean or median point of each cluster as its common pattern. 

Related works: As its generalization, chromatic clustering is naturally related to the traditional 
clustering problem. Due to the additional chromatic constraint, chromatic clustering could behave 
quite differently from its counterpart. For example, the k- means algorithms in (6j[l5) relies on the 
fact that all input points in a Voronoi cell of the optimal k mean points belong to the same cluster. 
However, such a key locality property no longer holds for the &-CMeans problem. 

Chromatic clustering falls in the umbrella of clustering with constraint. For such type of clustering, 
several solutions exist for some variants [5|. Unfortunately, due to their heuristic nature, none of 
them can yield quality guaranteed solutions for the chromatic clustering problem. The first quality 



guaranteed solution for chromatic clustering was obtained recently by Ding and Xu. In 10 , they 
considered a special chromatic clustering problem, where every point-set has exactly k points in the 
first quadrant, and the objective is to cluster points by cones apexed at the origin, and presented 
the first PTAS for constant k. The A;-CMeans and A:-CMedians problems considered in this paper 
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are the general cases of the chromatic clustering problem. Very recently, Arkin et al. |1| considered a 
chromatic 2D 2-center clustering problem and presented both approximation and exact solutions. 

1.1 Main Results and Techniques 

In this paper, we present three main results, a constant approximation and a (1 + e)-approximation 
for fc-CMeans and their extensions to fc-CMedians. 

— Constant approximation: We show that given any c-approximation for /c-means clustering, it 
could yield a (2ck 2 + 2k — l)-approximation for /c-CMeans. This not only provides a way for us 
to generate an initial constant approximation solution for fc-CMeans through some fe-means algo- 
rithm, but more importantly reveals the intrinsic connection between the two clustering problems. 

— (1 + e)-approximation: We show that a near linear time (1 + e)-approximation solution for k- 
CMeans can be obtained using an interesting sphere peeling algorithm. Due to the lack of locality 
property in fc-CMeans, our sphere peeling algorithm is quite different from the ones used in (6 15 



which in general do not guarantee a (1 + e)-approximation solution for A;-CMeans as shown by our 
first result. Our sphere peeling algorithm is based on another standalone result, called Simplex 
Lemma. The simplex lemma enables us to obtain an approximate mean point of a set of unknown 
points through a grid inside a simplex determined by some partial knowledge of the unknown 
point set. A unique feature of the simplex lemma is that the complexity of the grid is independent 
of the dimensionality, and thus can be used to solve problems in high dimensional space. With the 
simplex lemma, our sphere peeling algorithm iteratively generates the mean points of £;-CMeans 
with each iteration building a simplex for the mean point. 
— Extensions to k- C Medians: We further extend the idea for fc-CMeans to /c-CMedians. Particu- 
larly, we show that any c-approximation for fe-medians can be used to yield a ((2 + e)c/c 2 + (2+e)fe-|- 
l)-approximation for A;-CMedians, where the e error comes from the difficulty of computing the 
optimal median point (i.e., Fermat Weber point). With this and a similar sphere peeling technique, 
we obtain a (5 + e)-approximation for fc-CMedians. Note that although k > 2 is a constant in this 
paper, a (5 + e)-approximation is still much better than a ((2 + e)cfc 2 + (2 + e)fc + l)-approximation. 

Due to space limit, many details of our algorithms, proofs, and figures are put in Appendix. 

2 Preliminaries 

In this section, we introduce some definitions which will be used throughout the paper. 

Definition 1 (Chromatic Partition). Let Q = {G\,--- ,G n } be a set of n point-sets with each 
Gi = {p\, ■ ■ ■ ,p\-} consisting of ki < k points in ~K d space. A chromatic partition of Q is a partition 
of the Yli<i<n points into k sets, Ui, • ■ • , Uf., such that each Ui contains no more than one point 
from each Gj for j = 1, 2, • • • ,n. 

Definition 2 (Chromatic fe-means Clustering (&;-CMeans)). Let Q = {G\,--- , G n } be a set 

of n point-sets with each Gi = {p\, . . . ,p\.} consisting of ki < k points in ~R d space. The chromatic 
k-means clustering (or k-CMeans) of Q is to find k points {mi, • • • ,mfc} in lR. d space and a chromatic 
partition U\, ■ ■ ■ ,Uk of Q such that - ^ - XlgeC/ Wl ~ m j\\ 2 ^ s minimized. The problem is called full 
k-CMeans if k\ = ki = ■ ■ ■ = k n = k. 

For both /c-CMedians and fc-CMeans, a problem often encountered in our approach is "How to find 
the best cluster for each point in Gi if the k mean or median points A = {mi, ■ ■ ■ ,m,k} are already 
known?" An easy way to solve this problem is to first build a complete bipartite graph (GiL)A, Ei) with 
points in Gi and A as the two partites and then compute a minimum weight bipartite matching as the 
solution, where the edge weight is the Euclidean distance or squared distance of the two corresponding 
vertices. Clearly, this can be done in a total of 0(k 3 dn) time for all GVs. (We call this procedure as 
bipartite matching.) 
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3 Hardness of fc-CMeans 



It is easy to see that /c-means is a special case of fc-CMeans (i.e., each G{ contains exactly one point). 
As shown by Dasgupta |8j, fc-means in high dimensional space is NP-hard even if fc = 2. Thus, we 
immediately have the following theorem. 

Theorem 1. k-CMeans is NP-hard for k > 2 in high dimensional space. 
3.1 Is Full fc-CMeans Easier? 

It is interesting to know whether full fc-CMeans is easier than general fc-CMeans, since it is disjoint 
with fc-means when k > 2. The following theorem gives a negative answer to this question. 

Theorem 2. Full k-CMeans is NP-hard and has no FPTAS for k > 2 in high dimensional space 
unless P=NP (see Appendix for the proof). 

The above theorem indicates that the fullness of /c-CMeans does not reduce the hardness of the 
problem. However, this does not necessarily mean that full /c-CMeans is as difficult as general fc- 
CMeans to achieve a (1 + e)-approximation for fixed k. Below we show that a (1 + e)-approximation 
can be relatively easily achieved for full /c-CMeans through some random sampling technique. 



First we introduce a key lemma from 13 . Let S be a set of n points in M d space, T be a randomly 
selected subset from S with t points, and x(S), x(T) be the mean points of S and T respectively. 

Lemma 1 ( 13 ). With probability 1-r], \\x(S)-x(T)\\ 2 < ±Var°(S), whereVar°(S) = (E se sll s - 



x{SW)/n. 

\S'\ 

Lemma 2. Let S be a set of elements, and S be a subset of S such that W = a. If randomly select 

t In — 

in(i+L) = ^(q m rp e ^ emen ^ s from S, with probability at least 1 — w, the sample contains at least t 
elements from S 1 . 

Proof. If we randomly select z elements from S, then it is easy to know that with probability 1 — 
(1 — a) 2 , there is at least one element from the sample belonging to S' . If we want the probability 

In - In - In - -, f 

1- (l-a) z equal to l-rj/t, z has to be ^-f- = ln(1+ !g_ } < ln(1 + a) = 0{- In -) (by Taylor series and 

a < 1, ln(l + a) = O(a)). Thus if we perform t rounds of random sampling with each round selecting 
(3(^ In -) elements, we get at least t elements from S' with probability at least (1 — rj/t) 1 > 1 — rj. □ 

Lemma[l]tells us that if we want to find an approximate mean point within a distance of eVar°(S) 
to the mean point, we just need to take a random sample of size 0(l/e). Lemma [2] suggests that for 
any set S and its subset S' C S of size a\S\, we can have a random subset T of S' with size 0(l/e) by 
randomly sampling directly from S 0(— In ^) points, even if S' is an unknown subset of S. Combining 
the two lemmas, we can immediately compute an approximation solution for full fc-CMeans in the 
following way. First, we note that in full fc-CMeans, each optimal cluster contains exact n points from 
the total of fcn points in Q. This means that each cluster has a fraction of j- points from Q. Then, we 
can obtain an approximate mean point for each optimal cluster by (1) randomly sampling 0(-ln -) 
points from Q, (2) enumerating all possible subsets of size 0(1/ e) to find the set T which is a random 
sample of the unknown optimal cluster, and (3) computing the mean of T as the approximate mean 
point of the optimal cluster. Finally, we can generate the k chromatic clusters from the k approximate 
mean points by using the bipartite matching procedure (see Section [2]) . 

Theorem 3. With constant probability, a (1 + e)- approximation of full k-CMeans can be obtained in 
0(2P°fo(£) nd) time. 

With the above theorem, we only need to focus on the general /c-CMeans problem in the remaining 
sections. Note that in the general case, some clusters may have a very small fraction (rather than 
1/fc) of points, thus we can not use the above method to solve the general /c-CMeans problem. 
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4 Constant Approximation from fc-means 

In this section, we show that a constant approximation solution for fc-CMeans can be produced from 
an approximation solution of fc-means. Below is the main theorem of this section. 
Theorem 4. Let Q = {G\, • • • , G n } be an instance of k-CMeans, and C be the k mean points of a 
constant c- approximation solution of k-means on the points U™ =1 Gj. Then [C] k contains at least one k- 
tuple which could induce a (2ck 2 + 2k — 1)- approximation of k-CMeans onQ, where [C] k = C x • ■ ■ x C . 

To prove Theorem [4j we first introduce two lemmas. k 
Lemma 3. Let P be a set of points in M. d space, and m be the mean point of P. For any point 
m' G M. d , Yp^p \ \p — m'\\ 2 = Yp&p \\p ~ m \\ 2 + l-^l x H m ~~ fn'\\ 2 (see Appendix for the proof). 
Lemma 4. Let P be a set of points in M d space, and P\ be its subset containing a\P\ points for some 
< a < 1. Let m and m\ be the mean points of P and P± respectively. Then \ \m\ — m\\ < 



where 5 2 = rpr Y P eP \\p ~ m \\ 2 - 

Proof. Let P2 = P\Pi, and m2 be its mean point. By Lemma [3] we first have the following two 
equalities. 

Y \\p-™\\ 2 = Y \\p-mi\\ 2 + \Pi\ x Hrm-mll 2 . (1) 

pSPi pGPi 

Y. ||p — m|| 2 = Y^ \\p ~ m 2|| 2 + I -Pa I x \\rn2 — m\\ 2 . (2) 

P6P2 pSP 2 

Then by the definition of 5, we have 5 2 = rjj|(SpePi \ \p ~ m \\ 2 + Yl P ^p 2 \ \p ~ m \\ 2 )- Let L = 
\\mi - m 2 ||. By the definition of mean point, we have m = 4 YpepP = jpfEpePi P + S P eP 2 P) = 

(| Pi I mi + |P2| m -2)- Thus the three points {m, mi, 1712} are collinear, and | |mi — m\ \ = (l — a)L and 
\\ m 2 — m|| = aL. Combining ([I]) and Q, we have 

52 =L^(E lb-"»iH 3 + |ft| >< IK- m|| 2 + Y IIP-^|| 2 + |P 2 | x |K -m|| 2 ) 

P6P1 p£P 2 

> rL(\Pi\ X |K -HI" + |Pa| X \\m 2 - m\\ 2 ) = a((l - a)L) 2 + (1 - a)(aL) 2 = a{l - a)L. 



Thus, we have L < , = , which means that ||mi — m|| = (1 — a)L < \ — —8. □ 

■\Ja{l—a) V a 

Proof (of Theorem [4]). Let {c\,--- ,Ck} be the k mean points in C, and {Si,-- - ,5fc} be their 
corresponding clustersTLet {mi, • • • , m^} be the k unknown optimal mean points of fc-CMeans, and 
OVT = {Opti, • • • , Opth} be the corresponding k optimal chromatic clusters. Let P- = Opti n Sj, 
and rj be its mean point for 1 < i, j < k. Since U^ =1 Pj = Opti, by pigeonhole principle we know 

A 
./ 



Fig. 1. An example illustrating Theorem [4] 
that there must exist some index 1 < ji < k such that | Pj. | > ^ | Opti \ . Thus by fixing jj , we have the 
following about Y P eo P u \ \P ~ c iJI 2 ( see Figure fll 



Yl Hp - c Jil| 2 ~ Yj Ip ~ K | 2 + |Opii| x | |m,: — || 2 = Y; \\p ~ m i\\ 2 + \Opti\ x \\rrn — rj. + rj. — c 
peOpti peOpti peOpti 

12 



2 

ji I 



— lb" Wi|| 2 + |Opti| x (||mi -r].|| + ||rj. -c JlM; 

peOpti 

< Y IIp-^H 2 + I°^I x 2 (H^- 4H 2 + 114- c ^l| 2 ). ( 3 ) 
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where the first equation follows from Lemma [3] (note that rrii is the mean point of Opti), and the last 
inequality follows from the fact that (a + b) 2 < 2{a 2 + b 2 ) for any numbers a and b. By Lemma |4j we 
have 

H4-^H 2 <^(|oL E IIp-^H 2 )- ( 4 ) 



W n — Cj, j 1 2 < 



1 - 



{f^urr E IIp-^II"). (5) 



Plugging Q and ([5]) into inequality ([3]), we have 



E lb-c«H 2 < E l|p-m I || 2 + |Opt l |x2(||m l -4|| 2 + ||4- Cjl || 2 ) 

_ 1 1 

< £ || p _ JTll ||2 + | 0pti | x2( i_A ( 1 £ | b _ m .||2 ) + _T^ ( 1 ^ |b _ c . J|2)) 

= (2fe-l) £ || p - mi ||2 + 2 l^jlx(l-gj) £ || P - Cji || 2 ). 

Since | > ^|Optj|, we have t^tt x (1 — rg^r) — Thus the above inequality becomes 

E Ib-^'JI 2 < (2^-1) E \\p-rrH\\ 2 +2k ^ |p- Cj J| 2 . (6) 



Summing both sides of ([6]) over i, we have 



E E ib- Cj! n 2 <(2fc-i)E E iip-™*n 2 +2fcE E Hp- c iJi 2 

i=i peOpti i=i peOpti i=ipeSj. 

k k 

<(2fc-i)E E iiP-^n 2 + 2fc 2 E E Hp-^ii 2 ' ( 7 ) 



where the second inequality follows from the inequality X]pe5j lb ~~ c jJI ^ Sj=i Epes,- lb ~~ c ill > 

which implies that 2fc £f =1 EpeS* \ \p ~ CjM 2 < 2k2 E*=i EpeS, lb ~ c i IP- 
It is obvious that the optimal objective value of /c-means is no larger than that of /c-CMeans on 
the same set of points in Q. Thus, Ej=i EpeSj lb ~ c ill 2 - c Ei=i UpaOpU lb ~ m i\? '■ Plugging this 
inequality into inequality Q, we have 

k k 

E E ib- Cji ii 2 <(2 C fc 2 +2fc-i)^ Hp-^ii 2 - 

i=l p£Opti i=l p£Opti 

The above inequality means that if we take the /c-tuple (cj 1 , • • • , Cj k ) as the k approximate mean points 
for /c-CMeans, we have a (2ck 2 + 2k — l)-approximation solution, where the k chromatic clusters can 
be obtained by the bipartite matching procedure. Thus, the theorem is proved. □ 
Running Time: In the above theorem, the bipartite matching procedure takes 0(k 3 nd) time for 
one /c-tuple. Since there are in total 0(k k ) such /c-tuples, the total running time is 0(k k+3 nd) for 
computing a (2ck 2 + 2k — l)-approximation of /c-CMeans from a c-approximation of /c-means. As k is 
assumed to be a constant in this paper, the running time is linear. 

5 (1 + e) -Approximation Algorithm 

This section presents our (1 + e)-approximation solution to the /c-CMeans problem. We first introduce 
a standalone result, Simplex Lemma, and then use it to achieve a (1 + e)-approximation for /c-CMeans. 
The main idea of the algorithm is to use a sphere peeling technique to generate the chromatic clusters 
iteratively, where the Simplex Lemma helps to determine a proper peeling region. 



5.1 Simplex Lemma 

Simplex Lemma is mainly for approximating the mean point of some unknown points set P. The 
only known information about P is a set S of j points with each of them being an approximate mean 
point of a subset of P. The following Simplex lemmas show that it is possible to construct a simplex 
of S and find the desired approximate mean point of P inside the simplex. 




Fig. 2. An example for Lemma [5] with j = 4. Fig. 3. An example for Lemma [6] with j — 4. 

Lemma 5 (Simplex Lemma I). Let P be a set of points in M rf with a partition of P = uj =1 Pi 
and Pi x n Pi 2 = for any l\ 7^/2- Let o be the mean point of P, and o\ be the mean point of Pi for 
1 < I < j. Further, let 5 2 = rpr X^peP \\p ~ °\\ 2 > an ^ V ^ e ^ e s i m V^ ex determined by {01, • • • , 0j}. 
Then for any < e < 1, it is possible to construct a grid of size 0((8j /e) J ) inside V such that at least 
one grid point r satisfies the inequality \ \t — o\ \ < y/e5. 

Proof. We will prove this lemma by mathematical induction on j. 

Base case: For j = 1, since P\ = P. o\ = o. Thus, the simplex V and the grid are all simply the 
point o\. Clearly r = o% satisfies the inequality. 

Induction step: Assume that the lemma holds for any j < jo for some jo > 1 (i.e., Induction 
Hypothesis). Now we consider the case of j = jo + 1. First, we assume that jp| > ^ for each 
1 < I < j. Otherwise, we can reduce the problem to the case of smaller j in the following way. Let 

I = < I < j, jp[ < fj} be the index set of small subsets. Then, E 'j^ |Pi| < f , and E 'g | |P ' 1 > 1-f . 

By Lemma 4, we know that \\o' — o\\ < yj 1^/4 ^^ where d is the mean point of U;^/P/. Let (5') 2 be 

W <r \P\ A 2 



the variance of Ui^iPi. Then, we have (5') 2 < ^ lPl \ 5 2 < jzjji^ 2 ■ Thus, if we replace P and e by 

16^ 
c/.j 



Ui^jPi and respectively, and find a point r such that ||r — o'\\ 2 < jq(^') 2 < i-t% ^ 2 > we have 



T — O 



2 < (Ik - o'\\ + \\o' — o\\) 2 < i^uS 2 < e5 2 (where the last inequality is due to the fact e < 1). 



This means that we can reduce the problem to a problem with point set U/^/P; and a smaller j (i.e., 
j — \ I\). By the induction hypothesis, we know that the reduced problem can be solved (note that the 
simplex would be a subset of V determined by {01 \ 1 < / < j, I I}), and therefore the induction step 
holds for this case. Thus, in the following discussion, we can assume that jp| > ^ for each 1 < I < j. 



For each 1 < / < j, since jp| > jj, by Lemma 



we know that \\oi — oil < * _e_ 4j 5 < 2\H-5. 

V ij v e 

This, together with triangle inequality, implies that for any 1 < < j, \\oi — 0['\\ < \\oi — o\ \ + 
\\oi' — o\\ < 4y^5. Thus, if we pick any index Iq, and draw a ball B centered at o; and with 

radius r = maxi<Kj{||o/ — ^ ^\f\°~i the whole simplex V will be inside B. Note that since 

o = Xw=i ]pf i» a ^ so locates inside V. This indicates that we can construct B in the j — 1-dimensional 
space spanned by {o\, • • • , Oj}, rather than the whole ~R. d space. Also, if we build a grid inside B with 
grid length |j, the total number of grid points is no more than <3((^y). With this grid, we know 

that for any point q inside V, there exists a grid point g such that | \g — q\ \ < yjj(fj) 2 = ^ v^- 

This means that can find a grid point r inside V, such that ||r — o\\ 2 < e5 2 . Thus, the induction step 
holds. 

With the above base case and induction steps, the lemma holds for any j > 1. □ 
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In the above lemma, we assume that the exact positions of {o±,--- ,Oj} are known (see Fig. [2]). 
However, in some scenario (e.g., the exact partition of P is not given, as is the case in fe-CMeans), 
it is possible that we only know the approximate position of each mean point Oj (see Fig. [3]). The 
following lemma shows that an approximate position of o can still be similarly determined. 

Lemma 6 (Simplex Lemma II). Let P, o, Pi,oi, 1 < / < j, and, 5 be defined as in Lemma^ Let 
{o\, • • • , o'j} be j points in R d such that \ \o[ — oi\ | < L for 1 < I < j and L > 0, and V be the simplex 
determined by {o^, • • • , o'j}. Then for any < e < 1, it is possible to construct a grid of size 0((8j /e) 3 ) 
inside V such that at least one grid point r satisfies the inequality \ \t — o\ \ < y/e5 + (1 + e)L. 

5.2 Sphere Peeling Algorithm 

This section presents a sphere peeling algorithm to achieve a (1 + e)-approximation for fc-CMeans. 

Let Q = {G±, • • • , G n } be an instance of fc-CMeans with k (unknown) optimal chromatic clusters 
OVT = {Opt\, • • • , Opt^}, and rrij be the mean point of the cluster Optj for 1 < j < k. Without loss 
of generality, we assume that | Opt\ \ > \ Opt2 \ > ■ ■ ■ > \ Optk \ ■ 

Algorithm overview: Our algorithm first computes a constant C-approximation solution (by The- 
orem [i]) to determine an upper bound A of the optimal objective value 8g pt , and then search for a 
good approximation of 5^ pt in the interval of [A/C, A]. At each search step, our algorithm performs a 
sphere peeling procedure to iteratively generate k approximate mean points for the chromatic clusters. 
Initially, the sphere peeling procedure uses random sampling technique (i.e., Lemma [T] and [2]) to find 
an approximate mean point for Opt\. At (j + l)-th iteration, it already has approximate mean points 
{Pvi,--- ,Pvj} for Opt±,--- ,Optj respectively. Then it draws j peeling spheres, Bj + i^,--- ,Bj + \j, 
centered at the j approximate mean points respectively and with a radius determined by the approxi- 
mation of 5 op t- Denote the set of unknown points Optj + \ \ (Uj =1 i3j + i j i) as A. Our algorithm considers 
two cases: (a) |^4| is large enough and (b) \A\ is small. For case (a), since |^4| is large enough, we 
can first use Lemma [2] to find an approximate mean point m_4 of ^4, and then construct a simplex 
determined by m_4 and {p Vl , • • • ,p Vj }- For case (b), it directly constructs a simplex determined just 
by {Pv 1 , • • • ,Pvj}- For either case, our algorithm builds a grid inside the simplex (i.e., using Lemma 
[6j) to find an approximate mean point for Optj + i (i.e., p Vj+1 ). Repeat the sphere peeling procedure k 
times to generate the k approximate mean points. 

Algorithm fc-CMeans 

Input: Q = {Gi, • • • , G n }, k > 2, and a small positive value e. 
Output: (1 + e)-approximation solution for /c-CMeans on Q . 

1. Run the PTAS of A:-means in [15] on Q, and let A be the obtained objective value. 

2. For i = 1 to ^ do 

(a) Set 5 = + i-^ t \f~A, and run the Sphere- Peeling- Tree algorithm. 

(b) Let % be the output tree. 

3. For each path of every 71, use bipartite matching procedure to compute the objective value of 
fc-CMeans on Q. Output the k points from the path with the smallest objective value. 

Algorithm Sphere-Peeling- Tree 
Input: Q, k > 2, e,S > 0. 

Output: A tree T of height k with each node v associating with a point p v G M. d . 

1. Initialize T with a single root node v associating with no point. 

2. Recursively grow each node v in the following way 

(a) If the height of v is already k, then it is a leaf. 

(b) Otherwise, let j be the height of v. Build the radius candidates set 1Z = y|^s( fcra ) j j'2^ 2 ^fe5 \ 
< I < 4 + \ }. For each r G 11, do 

i. Let {p Vl , • • • ,p Vj } be the j points associated with nodes on the root-to-u path. 

ii. For each p V[ , 1 < I < j, construct a ball -Bj+i,i centered at p Vl and with radius r. 
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iii. Take a random sample from Q \ Lif =1 Bj + ij with size m = ^Tn^§. Compute the mean 
points of all subset of the sample, and denote them as 77 = {ttx, • • • , iT2 m -x}- 

iv. For each 7Tj £ 77, construct the simplex determined by {pv\ > ■ ■ ■ 

,p Vj , 7Tj}. Also construct the 

simplex determined by {p Vl , • • • ,p V j}- Build a grid inside each simplex with size 0((^)- J ). 

v. In total, there are 2 m ( 3 ^) J grid points inside the 2 m simplices. For each grid point, add 
one child to v, and associate it with the grid point. 

Theorem 5. With constant probability, Algorithm k-CMeans yields a (1 + e) -approximation for k- 
CMeans in 0(2 poly ^n{\ogn) k+l d) time. 
5.3 Proof of Theorem [5] 

Let (3j = \Optj\/\ U" =1 Gi\, and <5| = \ pt \ Y^ p eOpt 3 \ \p ~ m ill 2 ' where rrij is the mean point of Optj. 

Clearly, j3\ > ■ ■ ■ > (by assumption) and Ylj=i Pj = 1- Let 5^ = SjLi /^i^j • 

We prove Theorem [5] by mathematical induction. Instead of directly proving it, we consider the 
following two lemmas which jointly ensure the correctness of Theorem [5j 

Lemma 7. Among all the trees generated in Algorithm k-CMeans, with constant probability, there 
exists at least one tree, T%, which has a root-to-leaf path with each node Vj at level j, 1 < j < k, on 
the path associating a point p v . and satisfying the inequality \\p Vj — rrij\\ < e5j + (1 + e)j^J~^-d~ pt- 
Before proving this lemma, we first show its implication. 

Lemma 8. If Lemma [?| is true, Algorithm k-CMeans yields a (1 + 0(k 3 )e) -approximation for k- 
CMeans. 

Proof. We first assume that Lemma [7] is true. Then for each 1 < j < k, we have 

J2 \\p-pvA\ 2 = E lb-^ll 2 + |Op*i|x|K-p^ll 2 < E \\p-™A\ 2 + \Opt s \ x 2( e 2 «5| + (i + 6) 2 j 2 ^0 

= {l + 2e 2 )\Opt ] \8 2 3 + 2(l + effe\g\5l ptl (8) 

where the first equation follows from Lemma [3] (note that rrij is the mean point of Optj), the second 
inequality follows from Lemma [7] and the fact that (a + b) 2 < 2(a 2 + b 2 ) for any two real numbers a 

and b, and the last equality follows from i^^ii = |£7|. Summing both sides of (8) over j, we have 

Pj l_l 

& fe 

E E iiP-P. J ii 2 <E(( 1 + 2e2 )i°^i^ + 2 ( 1 + e ) 2 j 2e i e i4 t ) 

j=l peOptj 3=1 

k 

< (l + 2e 2 )Y / \Opt j \S] + 2(l + e) 2 k 3 e\g\S' 2 pt = (1 + 0(k 3 )e)\g\S 2 opt , (9) 

3=1 

where the last equation follows from the fact that Ylj=i \ Optj\S 2 = \G\S%p t . By (joj), we know that 
{p Vl , • • • ,Pv k } will induce a (1 + 0(A; 3 )e)-approximation solution for /c-CMeans via bipartite matching 
procedure. Since Algorithm fc-CMeans outputs the best solution generated in all trees, the resulting 
solution is clearly a (1 + 0(A: 3 )e)-approximation solution. Thus the lemma is true. □ 

The above lemma indicates that if we replace e by p- in the input of our algorithm, it will result 
in a (1 + e)-approximation solution. This implies that Lemma [7] is indeed sufficient to ensure the 
correctness of Theorem [5] (except for the time complexity) . Now we prove Lemma [7J 

Proof (of Lemma 1). Note that A < Ak 2 5 2 pt , and we build e-net in [^-,y/~A]. Let T% be the tree 
generated by Algorithm Sphere-Peeling- Tree and corresponding to the input 6 G [6 op t, (1 + e)d~opt}- We 
will focus our discussion on 71, and prove the lemma by mathematical induction on j. 
Base case: For j = 1, since (3\ = max{/5j|l < j < k}, we have j3i > p By Lemmas [l] and [2J we can 
find the approximation mean point through random sampling. Let py. be the approximation mean 

point. Clearly, \\p Vl — mi\\ < eSi < e5\ + (1 + e)^J~^5 op t (By Lemmas ljandj^J). 
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Induction step: We assume that there is a path in Ti from the root to the j'o-th level, such that for 
each 1 < I < jo, the level-/ node v\ on the path is associated with a point p v , satisfying the inequality 
| — mi\\ < eSi + il + ejlyf^Sopt (i.e., Induction Hypothesis). Now we consider the case of j = jo + 1. 
Below we will show that there is one child of fj-i, i.e., Vj, such that its associated point p v , satisfies 



the inequality \\p Vj — rrij\\ < eSj + (1 + e)j <l jj-5 op t- First, we have the following claim (see Appendix 
for the proof). 

Claim (1). In the set of radius candidates built in Algorithm Sphere- Peeling- Tree, there exists one 



value rj G 1Z such that 



<r 3 < (l+r)i. 



Jopt- 



Now, we construct the j — 1 peeling spheres, {Pj,i, • • • , Bjj—i} (as in Algorithm Sphere- Peeling- 
Tree). For each 1 < I < j — 1, Bj\ is centered at p Vl and with radius r,-. By Markov inequality and 
induction hypothesis, we have the following claim (see Appendix for the proof). 

Claim (2). For each 1 < I < j - 1, we have \OpU \ (U£l\ B j<w y <" W 



< 



Claim 2 shows that \Opti \ (Ui,=i Bj >w )\ is bounded for 1 < I < j 



1, which helps us to find 



the approximate mean point of Optj. Induced by the j — 1 peeling spheres {Pj,i, ■ ■ ■ ,-Bjj_i}, Optj 
is divided into j subsets, Optj n Bj t i, • • • , Optj n -Bjj-i and Optj \ (U^=i Bj,w)- To simplify our 
discussion, we let denote Optj n for 1 < / < j — 1, Pj denote Optj \ (U^i Bj,w)i and ^ 
denote the mean point of P[. Note that the peeling spheres may intersect with each other. For any 
two intersecting spheres Bj^ and Bjj 2 , we let the points set Optj n (Bj^ n -Bj,i 2 ) belong to either P/ 1 
or P; 2 arbitrarily. Thus, we can assume that {Pi \ 1 < / < j} are pairwise disjoint. Now consider the 



size of Pj (i.e., \Pj\). We have the following two cases: (a) \Pj\ > e 3 y|^| and (b) \Pj\ < e 3 y|^|. In 
the following, we show how, in each case, Algorithm Sphere-Peeling- Tree can obtain an approximate 
mean point for Optj by using the Simplex Lemma (i.e., Lemma ^1). 



,3 ft 





PV3 / 



\ Pv2 / 

Fig. 4. Case (a) for j = 4. Fig. 5. Case (b) for j — 4. 

For case (a), by Claim 2, together with the fact that < f3j for I > j, we know that 



IP, 



El<i<* IPPti \ (UCl 



4Q-l)/3j 



+ £ & + (fc - J)A 



8fej ~ 8k 2 ' 



This means that Pj is large enough, comparing to the set of points outside the peeling spheres. Hence, 
we can use random sampling technique to obtain an approximate mean point ir for Pj in the following 

way. First, we set t = Jy, rj = |, and take a sample of size ~ay|jgr = ~sr m ^w- By Lemma ij we know 

that with probability 1 — j| , the sample contains -3 points from Pj. Then we let 7r be the mean point 

of the points from Pj , and a 2 be the variance of Pj . By Lemma [IJ we know that with probability 



1 



mi 



< e 4 a 2 . Also, since > ^, we have a 2 < < ^b). Thus, ||vr-Tj|| 2 < ej5 

Once obtaining tt, we can now use Lemma [6] to find a point p Vj satisfying the condition of \ \p Vj — 
< eSj + (1 + e)j. 1 4-5 op t- First, we construct a simplex V!s determined by {p Vl , ■ ■ ■ ,p Vj _i} and 



(a) 

7T (see Figure. [4]). Note that Optj is divided by the peeling spheres into j disjoint subsets, Pi, • • • , Pj, 
which is a partition of Optj. Each p(l</<j — 1) locates inside P^/, which implies that 77 is also 
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inside Bjj. Further, since \\p vi — ti\\ < rj < (1 + ^)jyjr^opt for 1 < Z < j — 1 (by Claim 1), and 
IK — Tj\\ < V^jfij < \J~^-5opt (by (3j8j < 5 2 pt , which implies Sj < \J\j (3j5 op t), after setting the value 
of L (in Lemma 



6) to be max{rj, ||7r-rj||} < max{(l + ^ f:5opU y f~^opt} < (1 + f)jy jgr^opt and 

the value of e (in Lemma [6]) to be eo = e 2 /4, by Lemma [6] we can construct a grid inside the simplex 
with size 0((§ty) which ensures the existence of one grid point r satisfying the inequality of 

||r — mj\\ < -s/ioSj + (1 + €q)L < e5j + (1 + e)j <J~jfS pt- Hence, we can use r as p Vj , and the induction 
step holds for this case. 

For case (b), since Pj has a small size, we cannot directly perform random sampling on it to find 
its approximate mean point. To overcome this difficulty, we merge Pj with some other large subset 
Pi. Particularly, since Xw=i 1-^1 = iQpfyl ~~ — (Pj ~ e3 ~j~)\G\i by pigeonhole principle, we know 
that there exists one Iq such that P/ has size at least jry(/3j — e 3 y-)|C?|. Without loss of generality, 

we assume Iq = \. Then |Pi| > -^\{Pj — e 3 ^-)\Q\, and we can view Pi U Pj as one large enough subset 
of Optj . Let t' denote the mean point of Pi U Pj , then we have the following claim (see Appendix for 
the proof). 



Claim (3). Hn -r'|| < ^^5 opt . 

This means that we can also use Lemma [6] to find an approximate mean point in a way similar 
to case (a) (see Figure. [5]); the difference is that Optj is divided into j — 1 subsets (i.e., Pi and Pj 

is viewed as one subset Pi U Pj) and the value of L is set to be rj + ||ti — r'|| < Tj + jz^s ^J~^-$opt- 

We can first construct a simplex determined by {p Vl , ■ ■ ■ (see Figure. |5j), and then build a 

grid inside V/ b % with size 0((|^) JI ), where eo = e 2 /4. By Lemma 6, we know that there exists one grid 

point r satisfying the condition of ||r — rrij\\ < i/eoSj + (1 + €q)L < e5j + (1 + e)j ^Jjf5 pt- Thus the 

induction step holds for this case. 

Since Algorithm Sphere-Peeling- Tree executes every step in our above discussion, the induction 
step, as well as the lemma, is true. □ 

Success probability: From the above analysis, we know that in the j-th step/iteration, only case (a) 
(i.e., |Pj| > e 3 y-|C?|) needs to consider success probability, since case (b) (i.e., \Pj\ < e 3 ^-\G\) does not 
need to do sampling. Recall that in case (a), we take a sample of size ^§r In Thus with probability 
1 — |, it contains \ points from Pj. Meanwhile, with probability 1 — 4 , ||tt — Tj\\ 2 < e 4 a 2 . Hence, the 
success probability in the j-th step is (1 — |) 2 , which means that the success probability in all k steps 
is (1 - | ) 2k > 1 - 2e. 

Running time: Algorithm fc-CMeans calls Algorithms Sphere-Peeling- Tree ^ times. It is easy to see 
that each node on the tree returned from Algorithm Sphere- Peeling- Tree has \lZ\2 m (^-y children, 
where \K\ = O(^), and m = ^ In %. Since the tree has a height of k, the complexity of the tree is 
0( 2 P°M7)(l og n) fc ). Further, since each node takes 0{\TZ\2 m {'^-ynd) time, the total time complexity 
of Algorithm fc-CMeans is O(2 poly( -^n(logn) k+1 d). 

6 Extension to Chromatic fc-Medians Clustering 

We extend our ideas for /c-CMeans to the Chromatic /c-Medians Clustering problem (fe-CMedians). 
Similar to fc-CMeans, we first show its relationship with /c-medians, and then present a (5 + e)- 
approximation algorithm using the sphere peeling technique. Due to the lack of a similar Simplex 
Lemma for /c-CMedians, we achieve a constant approximation, instead of a PTAS. See details of the 
algorithm in Section 14 of the Appendix. 
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7 Figure. [6] 




Fig. 6. BAC probes of Chromosome 1 in a WI38 cell with homolog having 6 probes. 

8 Proof for Theorem [2] 

Proof. Since it is sufficient to show that the theorem holds for the case of k = 2, we assume in this 
proof that k = 2 and each point-set Gi has exactly two points. We make use of a construction by 
Dasgupta for the NP-hardness proof of the 2- mean clustering problem in high dimensional space [8] . 
Their proof reduces from the NAE3SAT problem. For better understanding our ideas, below we sketch 
their construction. 

1. For any instance (ft of NAE3SAT with literal set {xi, • • • , x n } and m clauses, construct a 2n x 2n 
matrix D a ^ as follows, where the indices correspond to {x±, ■ ■ ■ ,x n } when they are in the range 
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of [1, n], and to {aci, • • • , x n } when they are in the range of [n + 1, 2n]. 
( if a = (3 

D a fi = < | ^ ^ ■[ ^ a where A, 5 are two constants satisfying inequalities < 5 < A < 1 
_ 1 otherwise, 

and 45m < A < 1 — 25n, and a ~ (3 means that both a and /3 or both a and /3 appear in a clause. 

2. D can be embedded into R 2n , i.e., there exist 2n points in R 2n with Z) as their distance matrix. 

3. Let C\ and C2 be the two clusters of the 2-mean clustering of the 2n embedding points. If for any 
i, the points corresponding to Xi and xl are separated into different clusters, then 4> is satisfiable 
if and only if 

1 x - „ 1 x - „ 25m 

— > Da -\ > A,-<n-H . 

2n ^ M 2n ^ J ~ n 

4. Since ^ ^ Djj + ^ Ylij^c'2 ^ s *°^ a ^ cos * °^ * ne 2-mean clustering for C\ and C2, 
a polynomial time solution to the 2-mean clustering problem in high dimensional space implies a 
polynomial time solution to NAE3SAT. Thus the 2-mean clustering is NP-hard in high dimensions. 

The above reduction can be naturally extended to show the NP-hardness of the full chromatic 
2-mean clustering problem. To show this, we only need to construct Gi as the set containing the two 
points corresponding to x^ and xl (for simplicity, we write it as Gi = {xi,xi}), and the remaining 
proof follows from the same argument. 

Next, we show that full 2-CMean has no FPTAS in high dimensional space unless P=NP. To see 
this, we still use the same construction. From the above discussion, we know that <p is unsatisfiable if 
and only if for any chromatic partition of Q, there exists one clause in 4> such that the three points 
corresponding to the three literals in this clause are clustered into the same cluster. Hence, the total 
cost for any chromatic partition is at least 

1 /n\ 2 
2-( L +(m-l)5 + 35) = n-l + -(m + 2)5. 
n \2 J n 

The ratio rj between the minimum chromatic partition cost of an unsatisfiable instance and the 
upper bound cost of a satisfiable instance is 

^_n-l + l{m + 2)5 _ i+ £i 



n 



1 1 26m , . 2(m+2) r ' 

1 + — n-l+ - > 5 



—8 

If we let 5 = r — to- 1 then r? = 1 H % , = 1 H — m — n ^, 4 -.^ , r,, — Tor • 

5m+2n ' ' n ^ , 2(m+2) r n(5m+2nj(n— l)+2(m+2) 

n 

Suppose that there exists an FPTAS for the full chromatic 2-means clustering problem. Then, 
if we let e < TO ( 5m+ 2 n )(n-i)+2(m+2) ' * ne cos * of a (1 + e)-approximation of the full 2-CMeans is less 
than n — 1 + \(m + 2)5 if and only if <f> is satisfiable. Since the running time of the FPTAS for full 
2-CMeans and - are all polynomial functions of m and n, this implies that NAE3SAT can be solved 
in polynomial time. Obviously this can only happen if P=NP. □ 

9 Proof of Lemma |3] 

Proof. In the our following discussion, we use < a, b > to denote the inner product of a and b. It is 
easy to see that 

I \p — m! 1 1 2 = \ \p — m + m — m'\\ 2 
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= — m\\ 2 + 2 < p — m,m — m! > +\\m — m'\\ 2 ) 

p&p 

= \ \p — m\\ 2 + 2 < p — m,m — m > +\P\ x ||m — m'\\ 2 
peP peP 

= \ \p — m\\ 2 + 2 < ^^(p — rn), m — m > +|P| X ||m — m'|| 2 . 

p&P p&P 

Since m is the mean point of P, ^2 p£ p(p—m) = 0. Thus, the above equality becomes Ylp&p \ \P~ m '\\ 2 = 
^2 peP \\p — m\\ 2 + \P\ x \\m — m'\\ 2 . □ 

10 Proof of Lemma [6] 

Proof. Similar to Lemma [5j we prove this lemma by mathematics induction on j. 

Base case. For j = 1, since o\ = o, we just need to let r = d v Then, we have ||r — o\\ = 1 1 — o| | = 
— °l|| < L < -y/e^ + (1 + e)L. Thus, the base case holds. 

Induction step. Assume that the lemma holds for any j < jo for some jo > 1 (i.e., Induction 
Hypothesis). Now we consider the case of j = jo + 1. Similar to the proof of Lemma [5j we assume 
that jpj > for each 1 < I < j. Otherwise, it can be reduced to a problem with smaller j, and 

solved by the induction hypothesis. Hence, in the following discussion, we assume that tpi > for 
each 1 <l < j. 

First, we know that o = Yj(=i JF[°'- Let d = Yj{=i Jp\°'i- Then, we have 

H° " °'H = H t W\° l - t Jp\ '^ Z t Tpi l|0 < " °*H * R (10) 
i=i 11 i=i 1 1 i=i 1 1 

Thus, if we can find a grid point r within a distance to d no more than ^/e8 + eL (i.e., | |r — o'| | < 



yfe5 + eL), by inequality (10), we will have ||r — o\\ < \\t — d \\ + \\d — o\\ < ^5 + (1 + e)L. This 
means that we only need to find a grid point close enough to d . 

To find such a r, we first consider the distance from o[ to d . For any 1 < I < j, we have 

\\o'i -o'\\ < \\o[ - oi\ \ + \\oi -o\\ + \\o- o'\\ < 2^-5 + 2L, (11) 
where the first inequality follows from triangle inequality, and the second inequality follows from the 

"3. 



facts that \\o\ — o/|| and ||o — d\\ are both bounded by L, and \ \oi — o\ \ < 2y °-b (by Lemma 

This implies that we can use the similar idea in Lemma pj to construct a ball B centered at any d. 



and with radius r = maxi^Kjllloj—o'^ ||}. Note that since I |o^—Oj Q 1 1 < \\o\- d\\ + \\d— d lo \\ < 4y 3 -5+AL 
(by inequality (11)), the simplex V' is inside B. Similar to Lemma [5j we can build a grid inside B 



with grid length ^ and total grid points 0((8j /e) 1 ). Clearly in this grid, we can find a grid point r 
such that ||r — o'|| < < y/^S + eL. Thus, ||r — o|| < yfe5 + (1 + e)L, and the induction step, as 
well as the lemma, holds. □ 

11 Proof of Claim 1 in Lemma [7] 

Proof. Since 1 > (3j > ^ > there is one integer t between 1 and log(fcn), such that 2* _1 < < 2*. 
Thus 2 t / 2 - 1 ^5 opt < 

\MfO~opt ^ 2 t l 2 \fe5 pt- Together with 5 6 [Sopt-, (1 + £)$opt\i we have 
1 + e y f3j 
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Thus if set r,- = 2 t//2 y / e<5, we have 



p. "opt 



< fj < 2(1 + e) 



P Oopt 



Let x 



3 r i 



Then we have 



1 < x < 2(1 + e). We build a grid in the interval [ 2 (i+ e ) i x ] with the grid length 4 ^ 1 e +e ^ x, and obtain a 

grid set (i.e., number set) N = { 2 (i+ e ) x I < ^ < 4 + |}. We prove that there must exist one number 
in M and is between 1 and 1 + e/2. First, we know that 2 (\+e) ^^^x.Iix<l + e/2, we find the 
the desired number x in M. Otherwise, the whole interval [1,1 + e/2] is inside [ 2 (i+ £ ) ' ^ mce the 



grid has grid length 



tX < 



r2 1 + e 



4(1+<e) j ' - 4(l+e) 

[1, 1 + e/2]. Thus, the desired number exists in J\f 



e/2, there must exist one grid point locating inside 



Let IZj = { 2(i4-e) j'r? I < Z < 4 + — From the above analysis, we know that there exists one 



value rj G TZj such that 



j \IWj 5opt ~ rj " ^ + 



J opt- 



Note that Kj C K, where K = u£? (fen) {j&j2 t/2 Ve5 | < I < 4+ f }. Thus, the Claim is proved. 

□ 

12 Proof of Claim 2 in Lemma [7] 



Proof. First, for each 1 < I < j — 1, we have \Opti \ (U4=i — |Opfy \ -£>j,H- Secondly, by Markov 

inequality, we have 



\Opti\Bjj\ < 



{rj - \ \p Vl ~ ml 



,\Opti\. 



Note that 5 2 opt = Ylj=i }■> and ft ^ A (by I < j)- Thus, we have 5 t < ^j^opt ^ yp.v opt . 



\ - — 

Together with jyfj^&apt < r j an d ||p-u; — J7ij|| < e<5; + (1 + ^l-^f-^opt (by induction hypothesis), we 
have 



Hp 



■t-n 



mi\\ > jJ j: 6 o P t - (eS t + (1 + e)(j - l )\Jj&opt) 



= (1 - U ~ :5"<W - 

V 

>(l-(j-l)e-v^) A /J<W> 
V "j 



where the last inequality follows from 5i < ^opt < y j-$opt- Thus, we have 



\Opti\Bj,\ < 



^l-(i-l)e-Vi) 2 ^ 



< 



(l-(i-l) £ -^) 2 M 



I Opt, 



|Opi,| 



ft 



^l_ (j _l )e _^)2 eA 

Pj\S\ 



\OpU\ 



< 



(l-(i-l)e-Ve) 2 e " (l-jv^)V 
where the second inequality follows from the fact that (3i5f < 5^ pt , and the fourth equation follows 
from that ^°p^ = \G\- Note that we can assume e is small enough such that e < -^p, which implies 
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that r^z0^~ e — 4 ^ J J^ . Otherwise, we can just replace e by ^ as the input at the beginning of the 
algorithm. Thus, in total, we have 

Thus the Claim is proved. □ 



13 Proof of Claim 3 in Lemma [7] 



Proof. First, we have 



\Pi\ 



Pi UP- 



> 



-Mi - -1 

7-1 ^ 3 > 



J 



Ad 



J 3 



> 



1 + e 



3 ' 



Let a 2 denote the variance of Pi U P, . By Lemma 



since 



I a up, I 

| Opt, | - | Opt, | 



_Li > I-Pll > 1 



'Old 



we know that ||ti — t'\\ < \J fzp-a- Meanwhile, 

2 



ft 101 



4-, we have a 2 < , IPffp , <5 2 < J \ £ 2 . Then we have 

-1 ' — |FiUF,| 3 — 3 



Ti — T < 



2e 3 



ra < 



2e 3 /j-1 



< 



< 



(l-e3)(l 



i-^V 1 

5opt 



£»3 
J 



_ e 3)(l _ £ 3 )/3 . °opt - x _ e3 Oopt, 



where the third inequality follows from Sj < w -^-dopt- Thus, the claim is true. 



□ 



14 Chromatic fc-Medians Clustering 

In this section, we extend our ideas for A;-CMeans to the Chromatic /e-Medians Clustering problem 



(fc-CMedians). Similar to fc-CMeans, we first show its relationship with fe-medians (in Section 14.1) 



and then present a (5 + e)-approximation algorithm (in Section 14.2). Due to the lack of a similar 



Simplex Lemma for fe-CMedians, we achieve a constant approximation, instead of a PTAS. 

Definition 3 (Chromatic A;-Median Clustering (fc-CMedians)). Let Q = {G\, • • • , G n } be a set 

of n point-sets with each Gi = {p\, . .. consisting of hi < k points in M. d space. The chromatic k- 
median clustering (or k-CMedians) ofQ is to find k points {mi, • • • , m^} in M. d space and a chromatic 
partition U\, • • • , of Q such that i Ylj ^Zq^jjj II? — m ill * s minimized. 



14.1 Constant Approximation from fc-Medians 

Given a set of points in the optimal median point is also called Fermat Weber point in geometry. 
Its main difference with mean point is that no explicit formula exists for computing the optimal 
median point, while the mean point is simply the average of the given points. Consequently, median 
point is often approximated using some iterative procedure, such as Weiszj "eld's algorithm. Thus in 
the following discussion, we only assume the availability of a (1 + e)-approximation of the median 
point. 
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Lemma 9. Let P be a set of points in M d space, and P\ be a subset of P containing a fraction of a < 1 
points of P. Let m op t and m be the optimal and (1 + e) -approximate median point of P respectively, 
and mi be the optimal median of P\. Then \\m\ — m\ \ < where [i = ^ SpeP \ \p ~ m opt\\- 

Proof. Let fi\ = jp^J2 p& p 1 \ \p — mi\\- Since Pi C P, it is easy to know that Ylp&pWP ~ m \\ — 

EpePi lb _m ll) which implies that (l + e)/i > ^ Ep e p x lb~HI = a \F\\ Ep e p x lb~HI- % triangle 
inequality, we also have \\p — m\\ > \\m — m\\\ — \\p — m\\\. Thus, 

(l + e)/i>a(||m-mi||-/ii). (12) 

Since mi is the optimal median of Pi, we have \i\ = r^r XlpePi lb ~~ m ill — lpTJ SpePj lb ~~ m opt|| < 
mj S P eP 11^ ~~ m opt|| = \n- Plugging this into inequality (12), we have \\m — m\\\ < ^r 5 /^- □ 



Theorem 6. Let Q = {G\, ■ ■ ■ , G n } be an instance of k-CMedians, andC be the k (1+e) -approximate 
median points of the k clusters generated by a c- approximation k-medians algorithm on the points 
U2 =1 Gi. Then, [C] k contains at least one k-tuple whose elements are the k median points of a ((2 + 
e)ck 2 + (2 + e)k + 1) -approximation of k-CMedians on Q, where [C] k = C x • • • x C. 

k 

Proof. Let {ci, • • • , c/J be the set of k approximate median points in C, and {Si, • • • , S^} be the k 
clusters returned by the c-approximation A;-medians algorithm. Thus, Cj is the (1 + e)-approximate 
median point of Sj for 1 < j < k. Let OVT = {Opti, ■ ■ ■ , Opt/,} be the unknown optimal solution 
for fc-CMedians on Q, and mj be the optimal median point of Optj for 1 < j ' < k. Denote the set 
Opti n Sj as Tj, and its optimal median point as rj for 1 < i, j < k . 

Since U^ =1 Pj = Opti, there must exist some index 1 < ji < k such that |Pj.| > ^\Opti\. Fixing ji, 
we have the following about Ylp&OpU I \P ~ c k 1 1 ■ 

Y lb _c iill= Y \\p - m + m - cj t \\ 

peOpu p&Opu 

< Y (\\P ~ m iW + W m i ~ c nW) 
peOpu 

= Y lb ~ m i\\ + \Opti\ x \\rrii - CjJ| 
P&OpU 

= Y lb - m *ll + \opti\ x I bi» - 4 + 4 - cj-Jl 

< X] lb - m *ll + x (IK- 411 + 114-^11) 



By Lemma [9l we have 



peOpU 



14-miH < 2 ^(—^— ^ lb -^^1); 
* IUPUI p^OpU 

rt-^-^Wi S Ib-^ID- 
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From the above inequalities, we have 
Y IIP- c iJI< Y \\P- m i\\ + \°P t i\ X (\\ m i- T L\\ + \\ T ji- C 3i\\) 

peOpu peOpu 

< Y Wp-^W' + Iopt^-Pir^ £ lb-™ill) + ^(]^EllP-<*ll) 

paopu k ' P 4 1 paopu ' Jil peSjt 

= ((2 + e)fc + l) ||p-m i || + (2 + e )J^lx £ lb" <*!!)■ 

Since > ||Optj|, we have ^prr < k. Thus, 

Y \\p-c H \\ 2 <((2 + e)k + l) Y \\p-mi\\ + (2 + e)kY\\p-^\V 

P&OpU peOpu P^Sji 

Summing both sides of the above inequality over i, we have 

k k k 

Y Y \\p-c M \\ 2 <((2 + e)k + l)Y Y \\P - m\\ + (2 + e)kY Y WP - 

i=l pGOpti i=l p£Opti i=l p&Sj i 

k k 

<{{2 + e)k + l)Y Y \\p-™i\\ + (2 + e)k 2 YY\\P- c iW- ( 13 ) 
i=i peOpu j=i peSj 

It is easy to know that the optimal objective value of A:-medians is no larger than that of /c-CMedians 
on the same set of points in Q. Thus, ^2j=i J^peSj \ \P~ c j\\ ^ c Yli=i Y^peOpU \ \p- m i\\ - Plugging this 



inequality into (13), we have 



k k 
Y Y \\p-c jl \\<((2 + e)ck 2 + (2 + e)k + l)Y Y \\P~ m iW- 

i=l p£Opti i=l pdOpti 

The above inequality means that if we take the /c-tuple (cj 15 • • • , Cj h ) as the k approximate median 
points for /c-CMedians, we have a ((2 + e)ck 2 + (2 + e)k + l)-approximation solution for fc-CMedians. 
Thus, the theorem is proved. □ 

14.2 Peeling Algorithm for fc-CMedians 

The following lemma is a key to the peeling algorithm for fc-CMedians (i.e., play a similar role as 
Lemma [5] for fc-CMeans). 

Lemma 10. Let P to be a set of points in M. d with a partition P = Uj_ 1 Pj, o be its optimal median 
point, and o\ be the optimal median point of Pi for 1 < I < j. Let fj, = rm X^ p ep \ \P ~ °\\- Then, there 
exists some io such that \\o — OjJI < 4/i. 

Proof. Since \i = ^ pe "jp| P — = ^!=i(|pr ^^wr — ~)> there must exist some index iq such that 
E pS i» !0 lb-°ll 



< \i. By Markov inequality, we know that there exists one subset U of Pj such that 



\U\ > \Pi \/2 and \\p - o\\ < 2/i for any p E U. 

Epgp^ I |JP— o* 1 1 SpePi IIp-°II 
Since Oj is the optimal median point of Pj , jp— j < rp 5 — ■, < [i. Similarly, by 

Markov inequality, we know that there exists one subset V of P- l0 such that \V\ > |Pj |/2 and \\p 
o«oll ^ 2/i for any p £ V. 
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From the inequalities of \U\ > \Pi \/2 and | V] > |P» |/2 and the fact that U n V / 0, we 
know that there exists one point po G f7 n V such that ||po — °|| < 2/i and ||po — °i || < 2//. Thus 
||o io - o\\ < \\o io - poll + I bo - o\\ < 4/i. □ 

Before presenting our peeling algorithm, we still need the following lemma proved by Badoiu et 
al. in |6] for finding an approximate solution for 1-median. 

Theorem 7 ( |6|). Let P be a normalized set of n points in M. d space, 1 > e > 0, and R be a random 
sample o/0(l/e 3 log 1/e) points from P. Then one can compute, in 0{d2°^ l / e Mogn) time, a point-set 
S(P,R) of cardinality 0(2°( 1/,<E ) logn) , such that with constant probability (over the choices of R), 
there is a point q E S(P, R) such that cost(q, P) < (1 + e)med op t(P, 1). 

Algorithm /c-CMedians 

Input: Q = {G\, • • • , G n }, k > 2 and an small constant e > 0. 
Output: a (5 + e)-approximation solution for /c-CMedians on Q. 

1. Run the (1 + e)-approximation /c-medians algorithm from [15] on Q, and let Q be the obtained 
objective value. 

2. For i = 1 to ^ do 

e 

(a) Set 5 = H + i-^Q, and run Algorithm Sphere-Peeling- Tree-2. 

(b) Let 71 be the returned tree. 

3. For each path of every %, use bipartite matching procedure to compute the objective value of 
/c-CMeans on Q. Output the k points from the path with smallest objective value. 

Algorithm Sphere-Peeling- Tree-2 
Input: g, k > 2, e,5 > 0. 

Output: A tree T of height k with each node v associated with a point p v G 



pd 



1. Initialize T with a single root node v associating with no point. 

2. Recursively grow each node v in the following way 

(a) If the height of v is already k, then it is a leaf node. 

(b) Otherwise, let j be the height of v. Build the set of radius candidates 1Z = y^ fcw ) j -l^A j2 t / 2 y / ei5 
< / < 4 + |}.For each radius candidate r S K do 

i. Let j be the height of v, and {p vi , • • • ,p Vj } be the j points associated with nodes on the 
root-to- 1> path (including p v ). 

ii. For each p Vv 1 < I < j, construct a ball centered at p Vl and with radius r. 

hi. Take a random sample from g\uj =1 Bjj with size m = ^ In \. Compute the approximate 
median points of all subsets of the sample (by Theorem [7jj, and denote the set of the 
approximate median points as 77. Clearly, |77| = 2 m+0<yl l e ' logn. 

iv. For each point p in LJ, add one child to v, and associate it with p; add another j children, 
with each one associating with a different point in {p Vl , ■ ■ ■ ,p Vj }- 



We can use a similar approach as in Section 5.3 to analyze the correctness of Algorithm k- 
CMedians. 



Let OVT = {Opti,--- ^Optjz} be the optimal solution of A;-CMedians on g. Without loss of 
generality, we assume that |Opfi| > | Op£2 1 > • • • > \Optk\. For each Optj, 1 < j < k, let mj be its 
median point, /3j be its fraction in g (i.e., | Optj \/ \ U" =1 Gi\), and /Uj = jp^rj SpeOpij lb - m ill - Thus, 

Pl>-->Pk and /3j = 1- Also, let /x op t = Y!j=i Pj^j- 
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Lemma 11. Among all the trees generated in Algorithm k-CMedians, there exists one tree T%, which 
has a root-to-leaf path with each node Vj at level j , 1 < j < k, on the path associating a point p v . and 
satisfying the inequality 

\\Pvj - mj\\ < 4fij + (1 + e)j — ^ op t. 

P'j 



Lemma 12. If Lemma 11 is true, Algorithm k-CMedians yields a (5+0(k 2 )e) -approximation solution 
for k-CMedians. 

Proof. We first assume that Lemma [TT] is true. Then for each 1 < j < k, we have 
^ ||p-p^||< ^2 \\p - rrij\\ + \Optj\ x \\mj - p Vj \\ 

p&Optj p£Optj 

< ^2 \\P - m j\\ + \°Ptj\ x ( 4 Mj + (1 + ejjj-Vopt) 
P&Optj ' i 

= 5\0ptj\fij + (1 + e)je\g\nopt (14) 



Summing the both sides of (14) over j, we have 



k k 

^2 \\ p ~^ii 2 - 'Y^\°p t j\N + i 1 + e )j e \G\vopt) 

j=l p^Optj j=l 

k 

< b^2\0ptj\^ + (1 + e)k 2 e\g\fi opt 
i=i 

= (b + 0(k 2 )e)\g\^ opt . (15) 



In the above, the last equation follows from the fact that YLj=x \Optj\nj = \Q\(J>opt- By (15), we 
know that {p Vl , • • • ,Pv k } induces a (5 + 0(/c 2 )e)-approximation solution for fc-CMedians. □ 



By a similar argument given in the proof of Lemma [7j we can show the correctness of Lemma 11 
Thus, we have the following theorem. 

Theorem 8. With constant probability, Algorithm k-CMedians yields a (5 + e)~ approximation for 
k-CMedians in 0(2 po/ ^) n (logra) 2fc d) time. 
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